Set output_grads correctly #840

kwen2501 · 2023-07-14T21:00:56Z

Issue

In case there are multiple output values and one of them is loss, some user reported the following error:

    output_grads[i] for i in outputs_with_grads_idxs
IndexError: tuple index out of range
...
RuntimeError: 
        Failed to run backward stage stage_backward for stage %submod_7 : [ = call_module[target=submod_7](args = (%submod_6, %_inputs), kwargs = {})
        Stage output: ('Tensor(torch.Size([100, 20, 4096]), grad=False)', 'Tensor(torch.Size([100, 4096]), grad=False)', 'Tensor(torch.Size([100, 4096]), grad=False)', 'Tensor(torch.Size([]), grad=True)', 'Tensor(torch.Size([100]), grad=False)', 'Tensor(torch.Size([100]), grad=False)')
        Output gradient: ('None',)
        Input: ['Tensor(torch.Size([100, 20, 4096]), grad=True)', 'Tensor(torch.Size([100, 20, 4096]), grad=False)', 'Tensor(torch.Size([100]), grad=False)', 'Tensor(torch.Size([100]), grad=False)']

Note this part: Output gradient: ('None',)

I can repro the issue in local_test_c10d_bwd.py, if I change the output to:

-        return {"loss": loss}
+        return {"logits": x, "loss": loss}

Cause

The above issue is caused by the fixed setting in the else case:

                # (None,) is for `stage_backward` signature
                bwd_kwargs["output_grads"] = (
                    grads if len(grads) > 0 else (None,)
                )

This tuple (None,) should have variable length depending on the output.

Fix

Only update bwd_kwargs["output_grads"] when we have actually received gradients; otherwise, use the tuple prepared during IR phase, i.e. bwd_node.kwargs["output_grads"], which may look like (None, None) if there are two outputs.

## Issue In case there are multiple output values and one of them is loss, some user reported the following error: ``` output_grads[i] for i in outputs_with_grads_idxs IndexError: tuple index out of range ... RuntimeError: Failed to run backward stage stage_backward for stage %submod_7 : [ = call_module[target=submod_7](args = (%submod_6, %_inputs), kwargs = {}) Stage output: ('Tensor(torch.Size([100, 20, 4096]), grad=False)', 'Tensor(torch.Size([100, 4096]), grad=False)', 'Tensor(torch.Size([100, 4096]), grad=False)', 'Tensor(torch.Size([]), grad=True)', 'Tensor(torch.Size([100]), grad=False)', 'Tensor(torch.Size([100]), grad=False)') Output gradient: ('None',) Input: ['Tensor(torch.Size([100, 20, 4096]), grad=True)', 'Tensor(torch.Size([100, 20, 4096]), grad=False)', 'Tensor(torch.Size([100]), grad=False)', 'Tensor(torch.Size([100]), grad=False)'] ``` Note this part: `Output gradient: ('None',)` I can repro the issue in local_test_c10d_bwd.py, if I change the output to: ``` - return {"loss": loss} + return {"logits": x, "loss": loss} ``` ## Cause The above issue is caused by the fixed setting in the else case: ``` # (None,) is for `stage_backward` signature bwd_kwargs["output_grads"] = ( grads if len(grads) > 0 else (None,) ) ``` This tuple `(None,)` should have variable length depending on the output. ## Fix Only update `bwd_kwargs["output_grads"]` when we have actually received gradients; otherwise, use the tuple prepared during IR phase, i.e. `bwd_node.kwargs["output_grads"]`, which may look like `(None, None)` if there are two outputs.

Set output_grads correctly

fcc6e5f

facebook-github-bot added the cla signed label Jul 14, 2023

kwen2501 merged commit e60ebea into main Jul 14, 2023
21 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set output_grads correctly #840

Set output_grads correctly #840

kwen2501 commented Jul 14, 2023 •

edited

Loading

Set output_grads correctly #840

Set output_grads correctly #840

Conversation

kwen2501 commented Jul 14, 2023 • edited Loading

Issue

Cause

Fix

kwen2501 commented Jul 14, 2023 •

edited

Loading